Data Visualization

3.2

mpg
## # A tibble: 234 x 11
##    manufacturer model    displ  year   cyl trans   drv     cty   hwy fl    class
##    <chr>        <chr>    <dbl> <int> <int> <chr>   <chr> <int> <int> <chr> <chr>
##  1 audi         a4         1.8  1999     4 auto(l… f        18    29 p     comp…
##  2 audi         a4         1.8  1999     4 manual… f        21    29 p     comp…
##  3 audi         a4         2    2008     4 manual… f        20    31 p     comp…
##  4 audi         a4         2    2008     4 auto(a… f        21    30 p     comp…
##  5 audi         a4         2.8  1999     6 auto(l… f        16    26 p     comp…
##  6 audi         a4         2.8  1999     6 manual… f        18    26 p     comp…
##  7 audi         a4         3.1  2008     6 auto(a… f        18    27 p     comp…
##  8 audi         a4 quat…   1.8  1999     4 manual… 4        18    26 p     comp…
##  9 audi         a4 quat…   1.8  1999     4 auto(l… 4        16    25 p     comp…
## 10 audi         a4 quat…   2    2008     4 manual… 4        20    28 p     comp…
## # … with 224 more rows
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) +
  geom_point(mapping = aes(x = hwy, y = cyl))

ggplot(data = mpg) +
  geom_point(mapping = aes(x = class, y = drv))

Graphing Template

ggplot(data = <DATA>) + 
  <GEOM_FUNCTION>(mapping = aes(<MAPPINGS>))

Exercises

  1. What does the drv variable describe? drv describes whether the vehicle is front-wheel drive, rear-wheel drive, or 4-wheel drive.
  2. What happens if you make a scatterplot of class vs drv? Why is the plot not useful? Both values are categorative and most car models offer each type of drive.

3.3

# Adding a color aesthetic
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, color = class))

# Adding an alpha aesthetic, which adjusts transparency
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, alpha = class))
## Warning: Using alpha for a discrete variable is not advised.

# Adding a shape aesthetic
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy, shape = class))
## Warning: The shape palette can deal with a maximum of 6 discrete values because
## more than 6 becomes difficult to discriminate; you have 7. Consider
## specifying shapes manually if you must have them.
## Warning: Removed 62 rows containing missing values (geom_point).

# Manually setting aesthetic properties
ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Exercises

ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy, color = "blue"))
  1. What’s gone wrong with this code? the color needs to be removed from inside the parentheses for manually setting aeshetic characters, like mapping = aes(x = displ, y = hwy), color = "blue"))
  2. Which variables in mpg are categorical? Out of 11 variables, 6 are categorical and 5 are continuous
  3. Map a continuous variable to color, size, and shape. How do these aesthetics behave differently for categorical vs. continuous variables? Continuous variables differ from categorical variables in these variables by creating quantative spectrums rather than individual color codes.
ggplot(data = mpg) +
  geom_point(mapping=aes(x = displ, y = hwy, color = cty))

  1. What happens if you map the same variable to multiple aesthetics? Applies a color gradient that goes in the same direction as the axis of the same variable
ggplot(data = mpg) +
  geom_point(mapping=aes(x= displ, y = hwy, color = hwy))

  1. What does the stroke aesthetic do? What shapes does it work with? Modifies the width of the border on shapes which have borders.
ggplot(data = mpg, aes(x = displ, y = hwy)) +
  geom_point(shape = 21, color = "black", fill = "white", size = 5, stroke = 5)

  1. What happens if you map an aesthetic to something other than a variable name, like aes(color = displ < 5)? (Specify x and y) It does conditional aesthetics based off whether or not it fufills the criteria
ggplot(data = mpg) +
  geom_point(mapping=aes(color = displ < 5, x = hwy, y = cty))

3.5 Facets

To facet your plot by a single variable, use facet_wrap(). The first argument of facet_wrap() should be a formula, which you create with ~ followed by a variable name (here “formula” is the name of a data structure in R, not a synonym for “equation”). The variable that you pass to facet_wrap() should be discrete.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_wrap(~ class, nrow = 2)

To facet your plot on the combination of two variables, add facet_grid() to your plot call. The first argument of facet_grid() is also a formula. This time the formula should contain two variable names separated by a ~.

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(drv ~ cyl) # can't set a nrow here because the rows is determined by number of variables in 'cyl'

If you prefer to not facet in the rows or columns dimension, use a . instead of a variable name, e.g. + facet_grid(. ~ cyl).

ggplot(data = mpg) +
  geom_point(mapping = aes(x = displ, y = hwy)) +
  facet_grid(. ~ cyl)

Exercises

  1. What happens if you facet on a continuous variable? Partitions by that continuous variable
ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy)) +
  facet_grid(~year)

  1. What do the empty cells in plot with facet_grid(drv ~ cyl) mean? How do they relate to this plot? It means there is no overlap with those two variables, drive and number of cylinders, which is confirmed by the lack of dots at intersectinos in the plot below.
ggplot(data = mpg) +
  geom_point(mapping = aes(drv, cyl))

  1. What plots does the following code make? What does . do? The following code plots engine displacement against highway miles per gallon and splits the values up by drive style in the first plot and number of cylinders in the second. The . plots these facets against themselves, rather than another variable
ggplot(data = mpg) +
  geom_point(mapping=aes(displ, hwy)) +
  facet_grid(drv ~ .)

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy)) +
  facet_grid(cyl ~ .)

  1. Take the first plot in this section:
ggplot(data = mpg) + 
  geom_point(mapping = aes(x = displ, y = hwy)) + 
  facet_wrap(~ class, nrow = 2)

What are the advantages to using faceting instead of the colour aesthetic? What are the disadvantages? How might the balance change if you had a larger dataset? To me, the greates advantage of faceting is splitting up data that is very noisy or busy in a small range of space. It spreads out similar data much more, but it makes specific comparisoners harder since the data isn’t on the same plane.

  1. What does nrow do? What does ncol do? What other options control the layout of the individual panels? Why doesn’t facet_grid() have nrow and ncol arguments? nrow and ncol set the number of rows and columns, respectively. facet_grid() doesn’t have it because the two variables selected determines the number of rows and columns.

  2. When using facet_grid() you should usually put the variable with more unique levels in the columns. Why? There will be more space for columns if the plot is laid out horizontally (landscape).

3.6 Geometric Objects

To change the geom in your plot, change the geom function that you add to ggplot(). For instance, to make the plots above, you can use this code:

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy))

ggplot(data = mpg) + 
  geom_smooth(mapping = aes(displ, hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Every geom function in ggplot2 takes a mapping argument. However, not every aesthetic works with every geom. You could set the shape of a point, but you couldn’t set the “shape” of a line. On the other hand, you could set the linetype of a line. geom_smooth() will draw a different line, with a different linetype, for each unique value of the variable that you map to linetype.

ggplot(data = mpg) +
  geom_smooth(mapping = aes (displ, hwy, linetype = drv))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

To display multiple geoms in the same plot, add multiple geom functions to ggplot():

ggplot(data = mpg) +
  geom_point(mapping = aes(displ, hwy)) +
  geom_smooth(mapping = aes(displ, hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

This, however, introduces some duplication in our code. Imagine if you wanted to change the y-axis to display cty instead of hwy. You’d need to change the variable in two places, and you might forget to update one. You can avoid this type of repetition by passing a set of mappings to ggplot(). ggplot2 will treat these mappings as global mappings that apply to each geom in the graph. In other words, this code will produce the same plot as the previous code:

ggplot(data = mpg, mapping = aes(displ, hwy)) +
  geom_point() +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

If you place mappings in a geom function, ggplot2 will treat them as local mappings for the layer. It will use these mappings to extend or overwrite the global mappings for that layer only. This makes it possible to display different aesthetics in different layers.

ggplot(data = mpg, mapping = aes(displ, hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

You can use the same idea to specify different data for each layer. Here, our smooth line displays just a subset of the mpg dataset, the subcompact cars. The local data argument in geom_smooth() overrides the global data argument in ggplot() for that layer only.

ggplot(data = mpg, mapping = aes(displ, hwy)) +
  geom_point(mapping = aes(color = class)) +
  geom_smooth(data = filter(mpg, class == "subcompact"), se=FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercises

  1. What geom would you use to draw a line chart? A boxplot? A histogram? An area chart? **

- line chart: geom_line() - boxplot: geom_boxplot() - histogram: geom_histogram() - area chart: geom_area()

  1. Run this code in your head and predict what the output will look like. Then, run the code in R and check your predictions. It will show lines of best fit for each drive style’s displ and hwy
ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. What does show.legend = FALSE do? What happens if you remove it? Why do you think I used it earlier in the chapter? show.legend = FALSE removes the legend from the graph. It was removed in an earlier example because the other two graphs we were comparing didn’t have them since there wasn’t a third categorical variables.

  2. What does the se argument to geom_smooth() do? The se argument adds standard error bands around the line of best fit.

  3. Will the two graphs below look different? Why/why not? No they will look the same.

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_point() + 
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot() + 
  geom_point(data = mpg, mapping = aes(x = displ, y = hwy)) + 
  geom_smooth(data = mpg, mapping = aes(x = displ, y = hwy))
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

  1. Recreate the R code necessary to generate the following graphs.
ggplot(data = mpg, mapping = aes(displ, hwy, line="blue")) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(displ, hwy, line="blue")) +
  geom_point() +
  geom_smooth(mapping = aes(group = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(data = mpg, mapping = aes(displ, hwy, color = drv)) +
  geom_point() +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color= drv)) +
  geom_smooth(se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(aes(color = drv)) +
  geom_smooth(aes(linetype = drv), se = FALSE)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

ggplot(mpg, aes(displ, hwy)) +
  geom_point(size = 4, color = "white") +
  geom_point(aes(color = drv))

### 3.7 Statistical Transformation

Consider a basic bar chart, as drawn with geom_bar(). The following chart displays the total number of diamonds in the diamonds dataset, grouped by cut. The diamonds dataset comes in ggplot2 and contains information about ~54,000 diamonds, including the price, carat, color, clarity, and cut of each diamond. The chart shows that more diamonds are available with high quality cuts than with low quality cuts.

ggplot(data = diamonds) +
  geom_bar(aes(x = cut))

You can generally use geoms and stats interchangeably. For example, you can recreate the previous plot using stat_count() instead of geom_bar():

ggplot(data = diamonds) + 
  stat_count(mapping = aes(x = cut))

There are three reasons you might need to use a stat explicitly:

  1. You might want to override the default stat. In the code below, I change the stat of geom_bar() from count (the default) to identity. This lets me map the height of the bars to the raw values of a y variable. Unfortunately when people talk about bar charts casually, they might be referring to this type of bar chart, where the height of the bar is already present in the data, or the previous bar chart where the height of the bar is generated by counting rows.
demo <- tribble(
  ~cut, ~freq,
  "Fair", 1610,
  "Good", 4906,
  "Very Good", 12082,
  "Premium", 13791,
  "Ideal", 21551
)

ggplot(demo) +
  geom_bar(aes(cut, freq), stat="identity")

  1. You might want to override the default mapping from transformed variables to aesthetics. For example, you might want to display a bar chart of proportion, rather than count:
ggplot(diamonds) +
  geom_bar(aes(cut, ..prop.., group = 1))

  1. You might want to draw greater attention to the statistical transformation in your code. For example, you might use stat_summary(), which summarises the y values for each unique x value, to draw attention to the summary that you’re computing:
ggplot(diamonds) +
  stat_summary(
    aes(cut, depth),
    fun.ymin = min,
    fun.ymax = max,
    fun.y = median
  )